# VITA-VLA: Efficiently Teaching Vision-Language Models to Act via Action Expert Distillation

We are thrilled to announce the launch of VITA-VLA — a powerful and streamlined vision-language-action (VLA) model that combines architectural simplicity with highly efficient training. Designed to unlock the full potential of pretrained vision-language models in embodied AI tasks. VITA-VLA introduces a novel two-stage alignment and fine-tuning strategy, enabling superior performance with minimal computational overhead. Whether in simulation or real-world robotic scenarios, VITA-VLA delivers state-of-the-art results while remaining lightweight, modular, and reproducible.

✨ Highlights:

- **Two-Stage Alignment Strategy**: Hidden representation alignment with action models, followed by fine-tuning.
- **Efficient Training**: Freeze most of the VLM and use a small expert model.

- **Compact and Strong**: Adds only minimal layers — state encoder and query token.
- **Strong Performance**: Outperforms SOTA under constrained budgets on CALVIN ABC-D and LIBERO.
- **Open and Reproducible**: Based on public models and datasets.

---

## 🧠 Method Overview

**Overview of mainstream VLA architectures.**

1. **Discretization-based methods** convert actions into tokens and directly decode them using visual and language features, but omit **robot state information**, which is crucial for physical dynamics.
2. **Diffusion-based approaches** extract vision-language features with a VLM, but offload action generation to an action expert, making the VLM a passive feature extractor.
3. **Our method** introduces a **state encoder** and **action query token**, retains the full VLM, and distills knowledge from an expert model to achieve high reasoning and efficiency.

---

## 📈 Benchmark Results

**CALVIN ABC-D**

Our model demonstrates strong zero-shot generalization to unseen environments, achieving higher overall performance compared with existing VLA models. This highlights both the effectiveness of our two-stage distillation strategy and the importance of fine-tuning the VLM for action execution.

**LIBERO-LONG**

Our model achieves state-of-the-art performance, with a 5.8\% improvement over Seer-Large and a 1\% improvement over the fine-tuning-only strategy. These results validate the effectiveness of our approach in handling long-horizon tasks and complex instruction-following scenarios.

**LIBERO**


Our model achieves the highest average success rate across all task suites, outperforming existing VLA models by a significant margin. In particular, it improves the previous best result on LIBERO-LONG by 24.5\%, reaching a 97.3\% success rate. These findings demonstrate that our framework effectively combines the reasoning capacity of large-scale VLMs with the efficient action modeling of small action models.

---

## 🌍 Real-World Experiment

### 🛠️ Task Setup

- Platform: **ALOHA**
- Control: 6-DoF arm + gripper width
- Tasks: Pick, Place, Close, Stack
- Dataset: 500 demos (100 per task)
- Gripper supervision: **L1 loss (weight=1000)**

### 🎬 Execution Demo

**Instructions:**

1. "Close the drawer"
2. "Stack the orange cup on top of the green cup"
3. "Stack the red block on top of the yellow block"
4. "Pick up the sponge and put it into the basket"
5. "Pick up the red block and put it into the basket"

### 📊 Real-World Results Summary


- VITA-VLA achieves top performance across all tasks
- Strongest in long-horizon and stacking scenarios
- Validates two-stage training strategy

---
